Applications

Environment

In this section, we will look at a few more examples of Chinese text processing based on the data set demo_data/song-jay-amei-v1.csv. It is a text collection of songs by Jay Chou and Amei.

Loading Data

## loading
corp_df_text <- read_csv(file = "demo_data/song-jay-amei-v1.csv", 
                         locale = locale(encoding = "UTF-8"))

## creating doc_id
corp_df_text <- corp_df_text %>%
  mutate(doc_id = row_number())

corp_df_text

Overview of the Data Set

The data set demo_data/song-jay-amei-v1.csv is a collection of songs by two artists, Jay Chou and Amei Chang.

A quick frequency counts of the songs by artists in the data set:

corp_df_text %>%
  ggplot(aes(artist, fill=artist)) +
  geom_bar() 

Data Preprocessing (Cleaning)

Raw texts usually include very much noise. For example, irrelevant symbols, characters, punctuation, or redundant white-spaces (e.g., line breaks, tabs, etc.). It is often suggested that we clean up the texts before doing the tokenization.

## Define a function
normalize_document <- function(texts) {
  texts %>%
    str_replace_all("[\n\\p{C}]+", "\n") %>%  ## remove redundant line breaks
    str_replace_all("[ \u0020\u00a0\u3000\t]+", "") %>% ## remove full-width ws and tabs
    str_replace_all(" *\n *", "\n") ## clean up linebreak + ws
}

## Apply cleaning to every document
corp_df_text$lyric <- normalize_document(corp_df_text$lyric)  




# 
# 
# 
# my_segment <- function(texts) {
#   ## texts that have been preprocessed
#   texts %>%
#     map(str_split,"\n") %>% ## line tokenize
#     map(unlist) %>%
#     map(segment, my_seg) %>%
#     map(unlist)
# }
# 
# my_segment(corp$lyric[1:3])
# corp$lyric[1] %>%
#   normalize_document() %>% str_split("\n") %>% unlist %>%
#   segment(jiebar = my_seg) %>% unlist
# #   

Initialize jiebaR

Because we use jiebaR for word tokenization, we first need to initialize the jiebaR models. Here we created two jiebaR objects, one for word tokenization only and the other for parts-of-speech tagging.

# initialize segmenter

## for word segmentation only
my_seg <- worker(bylines = T,
                 #user = "",
                 symbol = T)

## for POS tagging
my_seg_pos <- worker(
  type = "tag",
  bylines = F,
  #user = "",
  symbol = T
)

We can specify the path to the external user-defined dictionary in worker(..., user = "").

Alternatively, we can also add add-hoc new words to the jiebaR model. This can be very helpful when we spot any weird segmentation results in the output.

By default, new_user_word() assigns each new word with a default n tag.

#Add customized terms 
temp_new_words <-c("")
new_user_word(my_seg, temp_new_words)
[1] TRUE
new_user_word(my_seg_pos, temp_new_words)
[1] TRUE

Tidytext Framework

The following are examples of processing the Chinese texts under the tidy structure framework.

Recall the three important steps:

  • Load the corpus data using readtext() and create a text-based data frame of the corpus;
  • Initialize a jieba word segmenter using worker()
  • Tokenize the text-based data frame into a line-based data frame using unnest_tokens();
  • Tokenize the line-based data frame into a word-based data frame using unnest_tokens();

## Line tokenization
corp_df_line <- corp_df_text %>%
  unnest_tokens(
    output = line, ## new unit name
    input = lyric, ## old unit name
    token = function (x)  ## tokenization method
      str_split(x, "\n+")
  ) %>%
  group_by(doc_id) %>%
  mutate(line_id = row_number()) %>%
  ungroup()

corp_df_line
## Word Tokenization
corp_df_word <- corp_df_line %>%
  unnest_tokens(
    output = word, ## new unit name
    input = line,  ## old unit name
    token = function(x)  ## tokenization method
      segment(x, jiebar = my_seg)
  ) %>%
  group_by(line_id) %>%
  mutate(word_id = row_number()) %>% # create word index within each document
  ungroup

corp_df_word

Creating unique indices for your data is very important. In corpus linguistic analysis, we often need to keep track of the original context of the word, phrase or sentence in the concordances. All these unique indices (as well as the source text filenames) would make things a lot easier.

Also, if the metadata of the source documents are available, these unique indices would allow us to connect the tokenized linguistic units to the metadata information (e.g., genres, registers, author profiles) for more interesting analysis.

Therefore, after tokenization, we have obtained a line-based and a word-based data frame of our corpus data.

Case Study: Word Frequency and Wordcloud

With a word-based data frame, we can easily create a word frequency list as well as a word cloud to have a quick overview of the word distribution of the corpus.

It should be noted that before creating the frequency list, we often need to consider whether to remove unimportant tokens (e.g., stopwords, symbols, punctuation, digits, alphabets.)

We can represent any character in Unicode in the form of \uXXXX, where the XXXX refers to the coding numbers of the character in Unicode (UTF-8) in hexadecimal format.

For example, can you tell which character \u6211 refers to? How about \u4f60?

In the above regular expression, the Unicode range [\u4E00-\u9FFF] refers to a set of frequently used Chinese characters. Therefore, the way we remove unimportant word tokens is to identify all word tokens consisting of these frequently used Chinese characters that fall within this Unicode range.

For more information related to the Unicode range for the punctuation marks in CJK languages, please see this SO discussion thread.

## load chinese stopwords
stopwords_chi <- readLines("demo_data/stopwords-ch-jiebar-zht.txt")

## create word freq list
corp_word_freq <- corp_df_word %>%
  filter(!word %in% stopwords_chi) %>% # remove stopwords
  filter(word %>% str_detect(pattern = "[\u4E00-\u9FFF]+")) %>% # remove words consisting of digits
  count(word) %>%
  arrange(desc(n))

library(wordcloud2)
corp_word_freq %>%
  filter(n > 20) %>%
  filter(nchar(word) >= 2) %>% ## remove monosyllabic tokens
  wordcloud2(shape = "pentagon", size = 0.3)

Case Study: Patterns

In this case study, we are looking at a more complex example. In corpus linguistic analysis, we often need to extract a particular pattern from the texts. In order to retrieve the target patterns at a high accuracy rate, we often need to make use of the additional annotations provided by the corpus.The most often-used information is the parts-of-speech tags of words.

In this example, we will demonstrate how to enrich our corpus data by adding POS tags information to our current tidy corpus design.

Our steps are as follows:

  1. Initialize jiebar object, which performs not only word segmentation but also POS tagging;
  2. Create a self-defined function to word-seg and pos-tag each text and combine all tokens, word/tag, into a long string for each text;
  3. With the line-based data frame apple_df_line, create a new column, which includes the enriched version of each text chunk, using mutate()
# define a function to word-seg and pos-tag a text fragment
tag_text <- function(x, jiebar) {
    segment(x, jiebar) %>% ## tokenize + POS-tagging
    paste(names(.), sep = "/", collapse = " ") ## reformat output
}

A quick example of the function’s usage:

# demo of the function `tag_text()`
tag_text(corp_df_line$line[10], 
         my_seg_pos)
[1] "貪/v 一點/m 愛/zg 什麼/r 痛/a 也/d 允許/v"
# apply `tag_text()` function to each text
corp_df_line <- corp_df_line %>%
              mutate(line_tag = map_chr(line, tag_text, my_seg_pos))

corp_df_line

Now we have obtained an enriched version of all the texts, we can make use of the POS tags for more linguistic analyses.

For example, we can examine the use of adjectives in lyrics.

The data retrieval procedure is now very straightforward: we only need to create a regular expression that matches our interested pattern and go through the enriched version of the texts (i.e., line_tag column in apple_df_line) to identify these matches with unnest_tokens().

1.Define a regular expression [^/\\s]+/a\\b for adjectives;

2.Use unnest_tokens() and str_extract_all() to extract target patterns and create a pattern-based data frame.

## define regex patterns
pat <- "[^/\\s]+/a\\b"

## extract patterns from corp
corp_df_pat <- corp_df_line %>%
  unnest_tokens(
    output = pat,  ## new unit name
    input = line_tag, ## old unit name
    token = function(x)  ## unnesting method
      str_extract_all(x, pattern = pat)
  )

## Data wrangling/cleaning
corp_df_pat_2 <- corp_df_pat %>%
  mutate(word = str_replace_all(pat, "/.+$","")) %>%  ## clean regex
  group_by(artist) %>% ## split df by artist
  count(word, sort = T) %>% ## create freq list for each artist
  top_n(20, n) %>%  ## select top 20 for each artist
  ungroup %>% ## merge df again
  arrange(artist, -n) ## sort result

## Visulization output
corp_df_pat_2 %>%
  mutate(word = reorder_within(word, n, artist)) %>%
  ggplot(aes(word, n, fill=artist)) +
  geom_bar(stat="identity")+ coord_flip()+
  facet_wrap(~artist,scales = "free_y") +
  scale_x_reordered() +
  labs(x = "Frequency", y = "Adjectives",
       title="Top 20 Adjectives of Each Artist's Songs")

Case Study: Lexical Bundles

N-grams Extraction

With word boundaries, we can also analyze the recurrent multiword units in the corpus. In this example, let’s take a look at the recurrent four-word sequences (i.e., four-grams) in our corpus.

As the default n-gram tokenization in unnest_tokens(..., token = "ngrams") only works with the English data, we need to define our own ngram tokenization functions.

The Chinese ngram tokenization function should:

  • Tokenize each text fragment (i.e., lines) into word tokens
  • Create a set of ngrams from the word tokens of each text
## self defined ngram tokenizer
tokenizer_ngrams <-
  function(texts,
           jiebar,
           n = 2 ,
           skip = 0,
           delimiter = "_") {
    
    texts %>% ## given a vector of lines/chunks
      segment(jiebar) %>% ## word tokenization 
      as.tokens %>% ## list to tokens
      tokens_ngrams(n, skip, concatenator = delimiter) %>%  ## ngram tokenization
      as.list ## tokens to list
  }

In the above self-defined ngram tokenizer, we make use of tokens_ngrams() in quanteda, which creates a set of ngrams from already tokenized text objects, i.e., tokens. Because this function requires a tokens object as the input, we need to do the class conversion via as.tokens() and as.list().

Take a look at the following examples for a quick overview of tokens_ngrams():

sents <- c("Jack and Jill went up the hill to fetch a pail of water",
           "Jack fell down and broke his crown and Jill came tumbling after")

sents_tokens <- tokens(sents) ## English tokenization
tokens_ngrams(sents_tokens, n = 2, skip = 0)
Tokens consisting of 2 documents.
text1 :
 [1] "Jack_and"  "and_Jill"  "Jill_went" "went_up"   "up_the"    "the_hill" 
 [7] "hill_to"   "to_fetch"  "fetch_a"   "a_pail"    "pail_of"   "of_water" 

text2 :
 [1] "Jack_fell"      "fell_down"      "down_and"       "and_broke"     
 [5] "broke_his"      "his_crown"      "crown_and"      "and_Jill"      
 [9] "Jill_came"      "came_tumbling"  "tumbling_after"
tokens_ngrams(sents_tokens, n = 2, skip = 1)
Tokens consisting of 2 documents.
text1 :
 [1] "Jack_Jill"  "and_went"   "Jill_up"    "went_the"   "up_hill"   
 [6] "the_to"     "hill_fetch" "to_a"       "fetch_pail" "a_of"      
[11] "pail_water"

text2 :
 [1] "Jack_down"     "fell_and"      "down_broke"    "and_his"      
 [5] "broke_crown"   "his_and"       "crown_Jill"    "and_came"     
 [9] "Jill_tumbling" "came_after"   

A quick example of how to use the self-defined function tokenizer_ngrams():

# examples
texts <- c("這是一個測試的句子",
           "這句子",
           "超短句",
           "最後一個超長的句子測試")

tokenizer_ngrams(
  texts = texts,
  jiebar = my_seg,
  n = 2,
  skip = 0, 
  delimiter = "_"
)
$text1
[1] "這是_一個" "一個_測試" "測試_的"   "的_句子"  

$text2
[1] "這_句子"

$text3
[1] "超短_句"

$text4
[1] "最後_一個" "一個_超長" "超長_的"   "的_句子"   "句子_測試"
tokenizer_ngrams(
  texts = texts,
  jiebar = my_seg,
  n = 2,
  skip = 1, 
  delimiter = "_"
)
$text1
[1] "這是_測試" "一個_的"   "測試_句子"

$text2
character(0)

$text3
character(0)

$text4
[1] "最後_超長" "一個_的"   "超長_句子" "的_測試"  
tokenizer_ngrams(
  texts = texts,
  jiebar = my_seg,
  n = 5,
  skip=0,
  delimiter = "_"
)
$text1
[1] "這是_一個_測試_的_句子"

$text2
character(0)

$text3
character(0)

$text4
[1] "最後_一個_超長_的_句子" "一個_超長_的_句子_測試"

With the self-defined ngram tokenizer, we can now perform the ngram tokenization on our corpus. We will use the line-based data frame (corp_df_line) as our starting point:

  1. We transform the line-based data frame into an ngram-based data frame using unnest_tokens(...) with the self-defined tokenization function tokenizer_ngrams()

  2. We remove empty and unwanted n-grams entries:

    • Empty ngrams due to short texts
    • Ngrams spanning punctuations, symbols, or paragraph breaks
    • Ngrams including alphanumeric characters
## from line-based to ngram-based
corp_df_ngram <- corp_df_line %>%
  unnest_tokens(
    ngram, ## new unit name
    line,  ## old unit name
    token = function(x) ## unnesting method
      tokenizer_ngrams(
        texts = x,
        jiebar = my_seg,
        n = 4,
        skip = 0,
        delimiter = "_"
      )
  )
Warning: Outer names are only allowed for unnamed scalar atomic inputs
## remove unwanted ngrams
corp_df_ngram_2 <- corp_df_ngram %>%
  filter(nzchar(ngram)) %>% ## empty strings
  filter(!str_detect(ngram, "[^\u4E00-\u9FFF_]")) ## remove unwanted ngrams

Frequency and Dispersion

A multiword unit can be defined based on at least two important distributional properties (See Biber, Conrad, and Cortes (2004)):

  • The frequency of the whole multiword unit (i.e., frequency)
  • The number of different texts where the multiword unit is observed (i.e., dispersion)

Now that we have the ngram-based data frame, we can compute their token frequencies and document frequencies in the corpus using the normal data manipulation tricks.

We set cut-offs for four-grams at: dispersion >= 5 (i.e., four-grams that occur in at least five different documents)

corp_ngram_dist <- corp_df_ngram_2 %>%
  group_by(ngram) %>%
  summarize(freq = n(), dispersion = n_distinct(doc_id)) %>%
  filter(dispersion >= 3)

Please take a look at the four-grams, arranged by frequency and dispersion respectively:

# arrange by dispersion
corp_ngram_dist %>%
  arrange(desc(dispersion)) %>% head(10)
# arrange by freq
corp_ngram_dist %>%
  arrange(desc(freq)) %>% head(10)

We can also look at four-grams with particular lexical words:

corp_ngram_dist %>%
  filter(str_detect(ngram, "我")) %>%
  arrange(desc(dispersion))
corp_ngram_dist %>%
  filter(str_detect(ngram, "你")) %>%
  arrange(desc(dispersion))

Quanteda Framework

All the above examples demonstrate the Chinese data processing with the tidytext framework. Here, we look at a few more examples of processing the data with the quanteda framework.

For Chinese data, the most important base unit in Quanteda is the tokens object. So first we need to create the tokens object based on the jiebaR tokenization method.

## create tokens based on self-defined segmentation
corp_tokens <- corp_df_text$lyric %>%
  map(str_split,"\n+", simplify=TRUE) %>% ## line tokenization
  map(segment, my_seg) %>% ## word segmentation
  map(unlist) %>% ## reformat structure
  as.tokens ## list to tokens

## add document-level metadata 
docvars(corp_tokens) <- corp_df_text[, c("artist","lyricist","composer","title","gender")]

Case Study: Concordances with kwic()

This is an example of processing the Chinese data under Quanteda framework.

Without relying on the Quanteda-native tokenization, we have created the tokens object directly based on the output of segment().

With this tokens object, we can perform the concordance analysis with kwic().

kwic(corp_tokens, "快樂")

Because we have also added the document-level information, we can utilize this metadata and perform more interesting analysis.

For example, we can examine the concordance lines of a keyword in a subet of the corpus:

corp_tokens_subset <-tokens_subset(corp_tokens, 
                                   str_detect(lyricist, "方文山"))
textplot_xray(
  kwic(corp_tokens_subset, "快樂"),
  kwic(corp_tokens_subset, "難過"))

Case Study: Comparison Word Cloud

In quanteda, we can quickly create a comparison cloud, showing the the differences of the lexical distributions in different corpus subsets.

corp_tokens %>% 
  dfm() %>%
  dfm_remove(pattern= stopwords_chi, ## remove stopwords
             valuetype="fixed") %>% 
  dfm_keep(pattern = "[\u4E00-\u9FFF]+",  ## include freq chinese char
           valuetype= "regex") %>% 
  dfm_group (groups = artist) %>% ## group by artist
  dfm_trim(min_termfreq = 10, ## distributional cutoffs
           min_docfreq = 2,
           verbose = F) %>%
  textplot_wordcloud(comparison=TRUE,
                     min_size = 0.8,
                     max_size = 4)

Case Study 2: Collocations

## extract collocations
corp_collocations <- corp_tokens %>%
  tokens_keep(pattern = "[\u4E00-\u9FFF]+",  ## include freq chinese char
              valuetype= "regex") %>%
  textstat_collocations(
    size = 2, 
    min_count = 10)


top_n(corp_collocations, 20, z)

Recap

Tokenizations are complex in Chinese text processing. Many factors may need to be taken into account when determining the right tokenization method. In particular, several important questions may be relevant to Chinese text tokenization:

  1. Do you need the parts-of-speech tags of words in your research?
  2. What is the base unit you would like to work with? Texts? Paragraphs? Chunks? Sentences? N-grams? Words?
  3. Do you need non-word tokens such as symbols, punctuation, digits, or alphabets in your analysis?

Your answers to the above questions should help you determine the most effective structure of the tokenization methods for your data.


感謝聆聽!

References

Biber, Douglas, Susan Conrad, and Viviana Cortes. 2004. “If You Look at…: Lexical Bundles in University Teaching and Textbooks.” Applied Linguistics 25 (3): 371–405.